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Abstract. Stencil computations on low dimensional grids are kernels of 
many scientific applications including finite difference methods used to 
solve partial differential equations. On typical modern computer archi- 
tectures such stencil computations are limited by the performance of the 
memory subsystem, namely by the bandwidth between main memory 
and the cache. This work considers the computation of star stencils, like 
the 5-point and 7-point stencil, in the external memory model. The anal- 
ysis focuses on the constant of the leading term of the non-compulsory 
I/Os. Optimizing stencil computations is an active field of research, but 
so far, there has been a significant gap between the lower bounds and the 
performance of the algorithms. In two dimensions, matching constants 
for lower and upper bounds are provided closing a gap of 4. In three di- 
mensions, the bounds match up to a factor of v2 improving the known 
results by 2\/^V~B, where B is the block (cache line) size of the external 
memory model. For higher dimensions n, the presented lower bounds 
improve the previously known by a factor between 4 and 6 leaving a gap 
of « f . 

1 Introduction 

Solving Partial Differential Equations (PDEs) is one of the most common tasks 
in scientific computing. A standard way to discretize low dimensional Euclidean 
spaces for these computations are regular grids. Applying a finite difference 
method on this discretization turns a differential operator into a linear func- 
tion of a grid point and its neighbors. Consider, for example, the linear ap- 
proximation of the Laplacian on a regular two dimensional grid as given by 
Au(x, y) = ^[u((x-h), y)+u((x + h), y)+u(x,y-h)+u(x,y + h)-4u(x, y)]. 
This defines the so called 5-point or 1-star stencil and turns the differential equa- 
tion at hand into a very regular sparse system of linear equations. To make use 
of the sparsity, such systems are typically solved with iterative solvers like the 
Jacobi or Gauss-Seidel method. The kernel of these methods is the evaluation 
of the underlying stencil, making it the most performance critical component of 
the computation. 

One characteristic of the stencil computation is that it performs relatively 
few floating point operations on the data. Frequently, the theoretically available 
peak floating point performance cannot be achieved because the memory system 
is the bottleneck limiting the speed. Over the last decades the performance of the 



memory systems did not increase at the rate of the CPU performance. Hence, 
optimizing the memory access became the main focus when designing high per- 
formance code in these situations. This focus is reflected in the computational 
models used to analyze algorithms and complexities of computational tasks. In 
this paper we work with the I/O-model that reflects two arbitrary levels of the 
memory hierarchy. In this model the complexity is given by the number of I/O 
operations, counting the number of cache misses or respectively the number of 
page faults. The model has the parameters M (main memory size) and B (block 
size). M describes the number of variables fitting into the memory in which 
computations are performed and B the size of a block (in variables) in which 
the disk is organized. These parameters reflect the two levels of the memory 
hierarchy that the model focuses on. 

Stencil operations are not only easy in the sense that there is hardly any 
choice in the floating point operations. The majority of the I/O operations is 
already needed for reading the input and writing the output. In fact, many simple 
algorithms for the 5-point stencil are within a factor of 5 of this lower bound 
and the classical asymptotic analysis is to coarse to give interesting insights. 
I/O operations related to the initial read of the input and the final write of 
the output are called compulsory I/Os or cold cache misses. All other I/Os are 
called non- compulsory I/Os or capacity misses (because they are unnecessary for 
sufficiently large M) . The analysis of stencil operations carried out in this paper 
gives almost matching bounds in the sense that it focuses on the constant of the 
leading term in the asymptotics of the non-compulsory I/Os. 

Here, the lower bounds did not only provide part of the complexity results, 
but actually guided the construction of the data layouts and algorithms improv- 
ing the upper bounds. What remains open is, on one hand, an experimental 
consideration how to turn the proposed algorithms into high performance code, 
and, on the other hand, to find matching bounds for situations with non-trivial B 
and standard layouts like a usual row or column layout. For the latter problem, 
we expect that the lower bound can be strengthened. 

1.1 Problem Definition 

The computational model is an I/O Model similar to [T] and [2]. There are two 
levels of memory, an external memory of infinite size on which all data is stored 
initially and an internal memory of size M to which the data has to be loaded 
to perform computations. The external memory is organized in blocks of size 
B. We classify the I/Os into cold misses or compulsory I/Os, which account for 
the first access to an element and writing the final output, and capacity misses 
or non- compulsory I/Os which are caused by the limited size of the internal 
memory. 

The computation graphs consist of two layers, an input and an output layer, 
of either the n dimensional grid or torus, [k] abbreviates {1, . . . , k}. [ki] x [fej] x 
... x [k n ] denotes the n-dimensional grid and Z/^ x Zfe 2 x ... x Zfe n the n- 
dimensional torus of side lengths ki. Two vertices x and y are joined if and only 
if ||ir — y\\\ = 1 where the component- wise computations are done in Z for the 
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grid and in for the i.th component of the torus. Denote with V the vertices 
of either the grid or torus. 

The stencils considered are called star stencils. In n dimensions the s-star 
stencil S s with center x is defined as S s (x) := {y e G : \\y — x\\i < s}. The 
most common star stencil is the 1-star stencil. Upper (lower) bounds for the 
s-star stencil induce upper (lower) bounds for all stencils which are subsets 
(supersets) of the s-star stencil. Hence meaningful choices also include s = 2 
and s = 3. Therefore s is assumed to be a small constant in our discussion. 
The computation graph for S s consists of two copies of V, Vi n := V x {in} and 
V ut '■= V x {out}, the vertices of which are connected in the following way: 
E := {{v in , v out ) e V in x V out : \\v in - < s). The norm is computed as if 

both vertices would belong to V. The computation graph is then {Vi n UV outl E). 
Denote by x in and x out the corresponding vertices of the input and output layer. 

Evaluating S s at y ou t € V ou t requires that all vertices to which y out is con- 
nected, namely S s (yi n ), arc in internal memory, i.e. we do not allow partial 
computations. The goal is to evaluate S s for all vertices of V out and write the 
results to external memory. 



1.2 Results 

This work examines the leading term of the non-compulsory I/Os of the s-point 
stencil. In two dimensions, matching lower and upper bounds are given closing a 
multiplicative gap of 4. In three dimensions the provided bounds match up to a 
factor of \/2 improving the known results by a factor of 2\fZ\[B. For dimensions 
bigger than three, the lower bounds are improved between a factor of 4 and 6 
leaving a gap of n ~\fn\ w — for higher dimensions n. 

We assume that the grid sides are ordered, k\ > ... > k n , and significantly 
larger than the internal memory, namely fci, k 2 > 2nM+M+l and fcj > 2nM+l 
for i S {3, . . . , n}. The asymptotics are considered for fcj — > oo, M — > oo and 
B — >• oo while assuming — >• oo and ^ — ^ oo. Denote by C s {k\, . . . , k n ) the 
number of I/Os to evaluate the s-point stencil on [fci] x . . . x [k n ]. Then the 
following holds (assuming n and s are constant): 
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The bounds consist of three parts. The first part is the constant 2 accounting for 
the compulsory I/Os. The second part is the leading term of the non-compulsory 
I/Os on which this work focuses. The third part characterizes lower order terms 
that we do not explore further. 

Both lower bounds and algorithms can be transferred to parallel external 
memory (PEM) as introduced in [3], as long as the number P of processors is 
smaller than jj Y\7=i ^i- I n this case, the complexities are reduced by a factor 
of P. Unlike with classical computational complexity (i.e. on a PRAM), there 
cannot be a general simulation of a parallel algorithm on a single processor that is 
only P times slower (it is possible to make use of the combined internal memory 
of size P ■ M) . Still, the lower bounds in this paper work in the parallel setting 
just as well: One round of the parallel computations, as defined by a certain 
number of non-compulsory I/Os, cannot evaluate more stencils than its serial 
counterpart. Regarding the algorithms, the most important observation is that 
all evaluations of stencils are independent from each other and could in principle 
be done in parallel. For example the extreme situation of one processor per grid 
point would work with as many parallel I/Os as there are points in the stencil, 
plus one I/O for writing the result, assuming that the model allows multiple reads 
and writes to the same block of external memory. This algorithm is obviously 
optimal, but it has many more non-compulsory I/Os than the algorithms we 
focus on. For moderate P we use the proposed serial algorithms and merely split 
the computation into P contiguous parts. The only additional non-compulsory 
I/Os are used to initially fill the local memory. Assuming P < jj Y17= 1 this 
is a lower order term, namely the one that we analyze as the difference between 
the torus and the grid. 

1.3 Related Work 

In pQ the external memory or I/O model was introduced by Hong and Kung for 
B = 1 . Using essentially an isopcrimctric argument they apply it to problems like 
the Fast-Fourier- Transform (FFT), matrix-matrix multiplication and products 
of graphs. The latter yields the first bounds for the non-compulsory I/Os of the 

1-star stencil ^ ..-y^ ■ 11"= l • Aggarwal and Vitter generalized Hong and 
Kungs model in [5] to arbitrary B. Using a counting argument they derive lower 
bounds for the problems of calculating the FFT, permuting, sorting and matrix 
transposition. In [4 Savage simplifies Hong and Kungs technique to the S'-span 
approach and in |5 Bezrukov surveys edge isoperimetric problems and shows 
how they can be used to determine the communication complexity in networks. 

The I/O complexity of the 1-star stencil has been discussed further indepen- 
dently by Frumkin and Wijngaart [E0 and Leopold |8|9)10|11] . The different 
results for the leading term of the non-compulsory I/Os are given in Table [l] and 
have to be multiplied by the number of vertices Jl"=i Frumkin and Wijngaart 
consider arbitrary dimensions but focus on the asymptotic behavior of the non- 
compulsory I/Os. The lower bound uses an isoperimetric argument similar to 
the one presented in this article but does not exploit its full strength. The best 
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Table 1. Comparison of the bounds for the leading term of the non-compulsory I/Os 
for the 1-star stencil. All to be multiplied with the number of grid points =1 ki . 





Presented Result 


Frumkin and Wijngaart 


Leopold 


Lower Bound 2D 
Lower Bound 3D 

Low. Bnd. Arbitrary D 


4 
BM 
8 1 
V3 BVM 
4. 2 1 /<™- 1 ).(„_l) i 


8 1 

9 BM 
2 1 

V3 BVM 

12\^T=i n 1 


2 

BM 
2 

Bv'M 

n.a. 


n ~fyn>. B "~\/M 


V 3 / ™-^/(n-l)!S n ~V~M 


Upper Bound 2D 
Upper Bound 3D 

Upp. Bnd. Arbitrary D 


4 
BM 

SV2 1 


°w 

°(-w) 


8 

BM 
4^6 


BVM 

4-2^(71-1) 

v ' B n VM 


n.a. 



lower bounds are given in [B]. We improve these results by a factor between 4 
and 6. The upper bound focuses on the asymptotic behavior and is an existence 
results. The best upper bound is stated in [7]. Leopold focuses on the two and 
three dimensional cases. Her lower bounds exploit a weak isoperimetric result 
|8I10] which we improve by a factor of 2 and ^ for two respective three dimen- 
sions. The upper bounds discuss row and column layouts. By using a data layout 
suited for our algorithms we decrease the the upper bounds by \ and -4= for 

z 3v 3 

two and three dimensions. Leopold also discusses two spatial and one temporal 
dimension in [3], which is out of the scope of this paper. 

There is vast ongoing research about optimizing stencil computations, mostly 
in two and three dimensions, on modern computer architectures. Since the gap 
between peak floating point performance and memory bandwidth is growing 
since years or even decades, this research focuses on improving the I/O behav- 
ior of the algorithms. When implementing one may need to be careful about 
the tradeoff between a more complicated data layout making a sophisticated 
padding scheme necessary and (theoretically) optimal I/O behavior. Addressing 
these problems is out of the scope of this work. However, diagonal hyperspace 
cuts, similar to the ones proven optimal in this work, are often employed in 
empirical work to select suitable substructures for computation. The literature 
includes work on compiler optimization |12ll3j , data layouts determined by space 
filling curves [2] and wavefront optimization [T3]. In the cache oblivious model 
asymptotic upper bounds are derived in [16J which are then shown to be achieved 
in |17)18|19| . A recent survey of the field is [2D]- 

2 The Lower Bounds 

The lower bound is derived by splitting an arbitrary algorithm into rounds of 
an equal number of non-compulsory I/Os. The work which can be done in each 
of these rounds is then bounded by an isoperimetric inequality. This yields the 
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minimum number of rounds which have to be performed by any algorithm. Mul- 
tiplying this with the number of non-compulsory I/Os that define a round yields 
the lower bound. The lower bound is first deduced assuming that an I/O oper- 
ation accesses one element (B = 1) and is then generalized for arbitrary B. 

2.1 The Isoperimetric Inequality 

The isoperimetric inequality states how many vertices can be enclosed by a 
fixed number of boundary vertices. The optimal sets in this sense are called 
isoperimetric sets and as proven by Bollobas and Leader in |21j the isoperimetric 
sets in Z£ are (fractional) t 1 ballsj^To state this result precisely we introduce 
some notation, mainly from |21j : 

A fractional system or simply system is a function from Z£ or Z™ to the unit 
interval [0, 1]. For / : Z™ — > [0, 1] the function can take non-zero values only for 
a finite number of grid points. The weight w of a system / is w(f) = X^ez™ f( x ) 
or w(f) = Xzez™ f( x ) according to the domain of /. A fractional system / on 
ZJJ or Z n is therefore a generalization of a subset 5 of Z£ or Z™ respectively. If a 
fractional systems / takes just the values and 1, then / is naturally identified 
with the set S = / (1) and the weight w(f) is the cardinality of S. The closure 
df of a system / is given by 



is called the fractional i 1 ball by ' a ' of radius r S No, < r < §, surplus 
a £ (0, 1) and center y £ Z£. For < v < k n we also use the notation by which 
describes the unique ball of weight v and center y. The following theorem, which 
states that balls have the smallest closure of all systems of the same weight, was 
proven by Bollobas and Leader: 

Theorem 1 (|21j - Theorem 4: An isoperimetric inequality on the dis- 
crete torus). Let k > 2 and even, let f be a fractional system on Z£. Then 



We need a version of this theorem which allows us to bound the number of inte- 
rior vertices given the number of boundary vertices. This differs in two aspects 
from the above theorem: First, we want to look at the inner-boundary of a set 
and not at its closure and second we need to have a result for systems of all 

1 It is well known that the isoperimetric sets in the continuous domains R" are £ 2 




The fractional set 




w (df) >w(db w W). 



balls. 
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weights but bounded inner-boundary. This will make it necessary to translate 
Theorem [T] to the infinite grid Z n where the boundary of the balls is growing 
strictly monotonic. To derive the desired result, we need more notation. 
Similar to the closure of / we define 



0, if f(x) < 1 

min {/(2/) : d(x,y) = 1}, if f(x) = 1 



the inner core of /, and generalize this to the inner- s-core by applying the oper- 
ator repeatedly, A s f(s) = A . . .A f(x). This is now used to defined the inner- s- 

s times 

boundary by r s f(x) = f(x) — A s f(x) . This allows to establish the first version 
of the isoperimetric inequality, stating that balls have the largest inner-core of 
all systems of the same weight. 

Lemma 1 (A version of the isoperimetric inequality). 

Let k > 2 be even, let f be a fractional system on ZJJ and s G N. Then 

w(r s f)>w(r s b w ^) 

which is by definition equivalent to 

w(AJ) < w (A s b w ^ . 

To prove the lemma we need an additional fact. 
Lemma 2. For a fractional system f on ZjJ the following inequality holds: 

d{A{f{x)) < f(x) . 



Proof. This is seen by examining the different cases carefully. If f(x) = it 
follows that d(Af)(x) = as well since all neighbors of x are set to by the 
Z\-operator. When < f(x) < 1, Af(x) = and for all y such that \\x — y\\ = 1 
we have Af(y) < f{x) and hence d(Af)(x) < f(x). When f(x) = 1 the claim 
holds trivially. □ 

Proof (of Lemma^. The claim is proven by induction over s. First, consider 
the case s = 1. If w(f) < 1 then w(rf) = w(f) and w(Af) — such that the 
claim holds. Assume there exists some fractional system / with w(f) > 1 such 
that 

w(rf) < w (rb w{f ^ and hence w(Af) > w ^Ab wU ^ . 
By the latter and the strict monotonicity of w we get 

w (db w ^) > w (db w ( Abm(f) )) . 
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To simplify the right hand side we use that the inner core of a ball is itself a ball 
and hence we can discard building the ball of it. 

w(db w ( AbWif) ))=w(dAb w W) . 

For a ball with w(f) > 1 the closure of the inner core is pointwise equal to the 
ball itself. Furthermore we employ Lemma [2] 

10 (dAb w{f ^ = w(b w ^) = w(f) > w(dAf) . 

Reading this sequence of inequalities altogether yields 

w (db w( < Af A > w(dAf) , 

which contradicts Theorem[l]for A(f) as fractional system and proves the claim 
for s = 1. 

Let us now prove the claim for s assuming it holds for s — 1. Using the 
induction assumption for Af we arrive at 

w(A s f) = w(A s ^Af) < w (A-i^^) . 

Noting that b' , Ab' and Z\ s _i&' are pointwise monotonically increasing yields that 
w(A s -ib') is monotonically increasing. Hence we can apply the result proven for 
s = 1 to yield 

w (A^VM) < w (z\ s _ 1 ^(^ </> )) . 

But the inner core of a ball is a ball itself, so we can discard building the ball of 
it and this simplifies to the required result 

□ 

Since the weight of the inner- s-core of a ball is monotonically increasing with 
the weight of the ball, this result can be used to deduce the implication 

(w(f) <v)=>( w(AJ) < w(A s b v ) ) . 

Nevertheless, we run into problems when bounding the weight of a ball given that 
its inner-boundary is bounded. The inner-boundary of balls is only monotonically 
increasing until v 4r- and thereafter monotonically decreasing. To overcome 
this problem, we transfer the results to the infinite grid, where the inner-s- 
boundary of balls is monotonically increasing with respect to the weight of the 
ball. 

Theorem 2 (The boundary bounds the core on Z n ). 

Let seff and f be a fractional system on Z n . For v G the following holds: 

( w(r 2s f) < w(r 2s b v ) ) =► ( w(Aj) < w(A s b v ) ) . (i) 
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Proof. The proof is split into two parts. ( w(/2s/) < w(r2 S b v ) ) =>■ ( u>(/) < v ) 
is first proven by contraposition. Hence, we first prove 

( w(r 2s f) > w(r 2s b v ) ) ( w(f) > v ) . 

Since / takes just a finite number of non-zero values, we can find k such that 
all non-zero values of / are in the grid {— k, . . . , k} n and we can embed / in the 
torus %2(k+i) sucn tnat no points 01 / touch were the grid is closed to a torus. 
Hence, the inner- s-boundary on Z" is exactly the same as on ^(k+i)- ® n * n ' s 
torus Lemma [l] yields w(r s f) > w(r s b w (^). To conclude the first part, note that 
the weight w(r s b v ) is strictly monotonically increasing with respect to v on Z™. 
This yields 

w(rj) > w(r s b w ^) > w(r s b v ) 

and since s was arbitrary also w(/2 S /) > w(r2 S b v ) which establishes the first 
part. 

Employing Lemma[l]again on the torus and noting that w(A s b v ) is monoton- 
ically increasing with respect to v yields (w(f) < v) ( w(A s f) < w{A s b v ) ) 
and the proof is complete. □ 



2.2 The Size of the I 1 Ball and its Boundary 

In this section we derive the asymptotic expansion for the number of vertices of 
a ball and its inner-boundary in Z™ with respect to the radius r, 

w ( & (r < 0) ) = — ■r n +0(r n - 1 ) and w ( r&< r '°> J - , ^-r^+Oir' 1 - 2 ) . 

\ J n\ V / (n — 1)! 

(2) 

As long as the sides of the torus or grid are big enough, k > 2(r+l), the formulas 
apply there also. Note that all lower order terms have positive coefficients. We 
derive these formulas by recursing over the dimensions. Hence it is useful to 

(t 0) i 

introduce the notation 6„ ' for the ball of radius r in n dimensions. The £ ball 
of dimension n consists of smaller balls of one dimension less, namely the level 
sets in the new dimension: 

1=0 

Another simple fact is bn' ^ = b„ 1 ' ^ + rbn' ^ which yields when combined 
with ([3| 

rtf' 0) - t-l + ■ (4) 

Since w (rb { n fi) ^ = 1 for all n £ N and w (rb[ r,0) ^) = 2 for r > 1 the weight of 
the one dimensional balls is given by 

4 r ' 0) = 2r + 1 . (5) 
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Recursion ^ yields that w (bn'°^ and w (_T&^' M are polynomials in r of 
degree n and n — 1 with non-negative coefficients. So they can be written as 



W ( fo i r ' 0) ) = E < • f< &nd ™ ( rfe n ' 0) ) = XX < ' ^ • 
i=0 i=Q 

Examining the leading term a„. „ of to f&rT' ^J yields 

7 — 1 r— 1 n— 1 



2=0 i=o i=o 

n-l 



= 0(0+2^ «n-l,iX Z< < 

i=o V ;=o / 



i=0 
n-l 



= £)(r"- 1 ) + 2V(«»_ 1 ,,^-] =2a„_i,„_i— + G(r 

Comparing the coefficient of the leading terms yields the recursion 

_ 2 

^n,n — ™n— 1, n— 1 ■ 

n 

The recursion stops with ([5|, namely ai. i = 2. Hence we get 

2™ . ,.,.„,, 2 n 



n-l\ 



Q 



n, n 



and w(6i r ' 0) ) = -r • r n + 0(r n ^) 



Now Q yields 



on on 

/3n, n-i = t rry and hence w(rb^) = - • r^ 1 + 0(r n - 2 ) 

(n — 1)1 [n — 1)1 



2.3 Pathwidth 

We introduce and employ pathwidth [22] to ensure that we are working on the 
"inside" of the torus and can treat it like the infinite grid which allows to use 
the isoperimetric results. Furthermore, the pathwidth of the grid ensures that 
non-compulsory I/Os are performed when the s-star stencil is evaluated. This 
allows to split an arbitrary algorithm into rounds of a certain number of non- 
compulsory I/Os. 

Definition 1 (Pathwidth [22] ). A path decomposition of a graph G = (V,E) 
is a sequence of subsets of vertices (Xi,X%, ... ,X r ), called bags, such that 

1- Ul<i< r Xi = V. 
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2. for all edges (v, w) G E there exists an i £ {1, ... , r} such that x S Xi and 
v e 

3. for all k such that l<i<j<k<rit holds that Xi n Xp. C Xj. 

The width of a path decomposition {X±, Xi-, ■ ■ ■ , X r ) is maxi<Kr \Xi \ — 1. The 
width of a graph G is the minimum width over all possible path decompositions 
ofG. 

Condition (3) implies that a vertex can only be in a consecutive block of bags 
and not reappear after it has been removed from a bag once. 

An algorithm evaluating the s-star stencil on G without non-compulsory I/Os 
and internal memory of size M implies that pathwidth(G) < M — 1. If we can 
evaluate the s-star stencil on G with internal memory of size M and without 
loading a vertex twice this immediately gives a path decomposition with bags of 
size at most M. The bags are the different sets of elements the internal memory 
is containing at different stages of the algorithm. Since the two dimensional grid 
[kx] x [fo] nas pathwidth min{fci, ki} (Corollary 89 of [22]) there have to be non- 
compulsory I/Os if we want to evaluate the s-star stencil on a two dimensional 
grid or torus with min{/ci, ki\ > M 

Pathwidth can also be modeled by a robber and cop game [2'6\ . This game 
is played on an arbitrary undirected graph like the grid or the torus. Initially p 
cops are placed on the vertices of the graph and afterwards the robber chooses 
its initial position. The robber is visible to the cops during the game and the 
game proceeds in rounds. First the cops announce were they want to be placed in 
the next round. Then every cop that wants to move boards a helicopter. While 
the cops are moving in the air the robber is allowed to move to an arbitrary 
vertex of the graph if he can reach it without running into a cop. Thereafter the 
cops land and the robber escapes in that round if no cop lands on the vertex the 
robber is standing on. The game then continues with the next round. If there is 
a strategy so that the robber is able to escape the cops for an infinite number of 
rounds we say that the robber wins. 

The following implication is proven in [23 : When the robber cop game is 
played with p cops on a graph G and if there is a strategy so that the robber 
wins, G has to have pathwidth bigger than p — 1. 

Lemma 3. If the subgraph H of a two dimensional grid or torus consists of 
p + 1 complete rows and complete columns, then pathwidth(H) > p. 

Proof. We give a strategy in the robber and cop game such that the robber 
wins against p cops to prove the claim. Since there are p + 1 complete rows and 
columns in H, the robber is free to start in a row which is empty after the initial 
placements of the cops. When the cops announce their move, there will be a free 
column in the next configuration. Since the robber is in a free row, it can move 
to this column which is free in the next configuration. The game now proceeds 
with rows and columns interchanged. The robber always escapes from a free row 
to a free column and vice versa. □ 
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2.4 Splitting an Arbitrary Algorithm into Rounds 

To derive the lower bounds assume an arbitrary algorithm evaluating the s-star 
stencil on Z^ x . . . x Zfc n is given. When min{fci, k 2 } > M the pathwidth of 
x . . . xZfc n is at least M and hence the algorithm causes non-compulsory I/O 
operations. We can count these operations and split the algorithm into rounds 
of c non-compulsory I/Os. c is denoted round length and hence all rounds except 
the last one cause c non-compulsory I/Os. The isoperimetric inequality allows to 
bound the work per round which yields a lower bound for the number of rounds. 
Multiplying the number of rounds with the number of non-compulsory I/Os c 
per round establishes the lower bound. This approach is similiar to the idea 
presented by Hong and Kung in 1 and therefore we call the rounds Hong-Kung 
rounds. 

To apply the isoperimetric inequality we need to establish a link between the 
inner-core and inner-boundary and the rounds. Assume an arbitrary algorithm is 
split into its Hong-Kung rounds of length c and one round is picked. Denote with 
S the set of vertices which are in internal memory at some point of this round. 
Let Transfer(S) be the transfer vertices of S. These are all vertices of S which 
are also available in other rounds. Eval(S), the evaluated vertices, are all vertices 
of S for which the s-point stencil is evaluated in the current round. The following 
two observations relate these sets to the inner-core and the inner-boundary. 

r 2s (S) C Transfer(S) and Eval(S) C A S (S) . (6) 

A vertex can only be evaluated in round S if all its neighbors within distance 
s are in S as well. A S (S) consists of exactly these vertices. Equivalently r s (S) 
are the vertices which cannot be evaluated in round S. Take any x £ r s (S). All 
vertices which are within distance s from x need to be in the round in which x is 
evaluated. Hence they need to be transferred. The set of all vertices of S within 
distance s from any of the vertices of r s (S) is /^(S). Therefore these vertices 
are a subset of the transfer vertices. 

Furthermore, we can give an upper bound for the transfer vertices per round 
and hence for the inner-2s-boundary vertices. At the beginning of a round at 
most M vertices are in internal memory. These vertices have also been in in- 
ternal memory at the end of the previous round and hence they are transfer 
vertices. The M vertices which are in internal memory at the end of the round 
are available to the next round so they are transferred as well. During one round 
an unlimited number of compulsory I/Os but only c non-compulsory I/Os occur. 
Assume a vertex does not cause a non-compulsory I/O and, since this case has 
already been taken into consideration, is not in internal memory at the beginning 
or end of that round. This vertex has then not been loaded before and hence 
it has not taken part in any of the previous rounds. Furthermore, since it does 
not cause a non-compulsory I/O it is also not written back to external memory 
and hence not available to any succeeding round. In conclusion, at most c addi- 
tional vertices, associated with the non-compulsory I/Os, are transfer vertices. 
Altogether, there are at most 2M + c transfer vertices per round, 

w( Transfer{S)) < 2M + c. (7) 
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Since all vertices of S \ Transfer(S) are not available for other rounds their s- 
star stencil has to be computed in the current round. Furthermore, since the 
vertices of S \ Transfer(S) do not cause non-compulsory I/Os they are loaded 
into internal memory only during the current round. Hence their pathwidth has 
to be less or equal to M — 1, 



2.5 Deducing the Lower Bound 

This section addresses the following issues: An arbitrary algorithm evaluating 
the s-star stencil on the Zk 1 x ■ • • x Zfc„ is split into its Hong-Kung rounds of 
c non-compulsory I/Os. Limited pathwidth of S \ Transfer(S) ensures that we 
can handle the torus like an infinite grid. This allows to apply the isoperimetric 
inequality to upper bound the vertices which can be evaluated per round and 
hence lower bound the number of rounds that are performed. This yields a lower 
bound for the number of non-compulsory I/Os on the torus which we finally 
transfer to the grid. 

So, assume an arbitrary algorithm evaluating the s-star stencil on TL^ x ■ ■ • x 
is given. Assuming fci, k% > M it has to perform non-compulsory I/Os and 
hence we can split the algorithm into Hong-Kung rounds of c non-compulsory 
I/Os each. The last round may have less non-compulsory I/Os. Fix one of the 
rounds and denote the vertices which are in internal memory in this round by S. 
We know w (Trans fer(S)) < 2M + c and pathwidth(S \ Transfer(S)) < M - 1. 

Assuming k x ,k 2 > 2M + c + (M + 1) and k t > 2M + c + 1 for i € {3, . . . , n} 
the vertices of (at least) M + 1 hyperplanes of normal x\, M + 1 hyperplanes 
of normal x 2 and one hyperplane of normal Xi (3 < i < n) do not belong to 
Transfer(S) . These hyperplanes form a connected component in Zfc t x • • • x 
So they could either be a subset of S \ Transfer(S) or disjoint from S. Taking 
the intersection of all hyperplanes results in a subset of a two dimensional torus 
of at least M + 1 complete rows and columns. Hence, by Lemma [3] we have 
pathwidth(S \ Transfer(S)) > M which contradicts Therefore, there is at 
least one hyperplane for each normal direction Xi (1 < i < n) which is disjoint 
from S. Deleting these hyperplanes results in a graph which can be embedded 
in the infinite grid Z n . 

Combining ^ and Q we get w(r2 S S) < 2M + c. Denote with vq the weight 



By Theorem [2] and ([6]) the number of evaluable vertices of S is bounded by 
w(Eval(S)) < w(A s b v °). This establishes the framework for the lower bound, 
namely the lower bound is given by 



pathwidth (S \ Transfer(S)) < M — 1 . 



(8) 



such that 



w(r 2s b V0 ) = 2M + c. 



(9) 



n 



C 



(10) 



w(A s b v °) 
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It remains to determine vq which in turn depends upon c. Since s is assumed to 
be small and constant we simplify Q before solving. Denote (r , «o) the radius 
and surplus such that b v ° — &( r o> Q °T Using ^ the asymptotics of w(r2 S b v °) is 
given by 

(r 2s b Vo ) = w (rb^ ro ' ao A + w (r 2s _i6( ro '°)) + w fr&fo-P"- 1 ^-^))) = 

2s — 2 y n 

(r + l)"" 1 + O (r»- 2 ) + £ t^Tv (ro " ^ + ° (r °" 2) + 

u — n V /* 

2" 



2" 



a • 



(n-l)f 

+ (1 - «o) • T^Y), (r - (2s - l))"" 1 + O (r - 2 ) 



fe=0 

\Tl-l 



= 2s • 



(n-1)! 



(11) 



Since all coefficients in the lower order terms are non-negative, dropping the 
lower order terms before solving (J9j) increases r and vq, increases w(A s b v °) 
and hence weakens the lower bound. Solving (111 without the lower order terms 
yields 



''o 



(n-1) 



! . 



2M 



2s ■ 2 n 



(12) 



Plugging that into (10 1 for the lower bound and maximizing over c (setting the 
derivative to and checking that the solution is a maximum) yields that the 
best round length, disregarding lower order terms, is approximately 

c=2(n- 1) • M. 

Using this round length, we determine an upper bound ro for the radius of a 
ball to be handled in one round using ( 12 1 



ro 



n\ M 
2™ ~~s 



2(n- 1)M 



and finally, by plugging this radius into ( 10 1 , the lower bound 
2(n-l)M _ 2{n-\)M ^ l 

n fc 



W (A s b( r «'V) w (bUro-*)' )) 

2{n-\)M 
2 ("-!) TT fc 

(^ s «) 1 /("- 1 ) M l/(n-l) +0(1) 1 



2(n- 1)M 
.(^ S n ) 1/(n_1) Mi/(»-i) 
1 



4(n- 1) 



(9 



-0 



1 



1 
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This bound was derived on the torus Zfe x x • • • x Zfc n and we can apply it to 
the grid [k\] x • • • x [k n ] using a reduction. Let bn' °^ denote the t 1 ball of radius 
r and surplus a in n dimensions. 

Lemma 4. Any algorithm using internal memory of size M and evaluating the 
s-point stencil on the grid [k\\ x • • • x [k„] induces an algorithm, using the internal 
memory M and evaluating the s-point stencil, on the torus Zj^ x • • • xZk n causing 

at most O (jj^i kij additional I/Os. 

Proof. When the algorithm for the grid is evaluated on the torus, only the ver- 
tices close the boundary of the grid have to be treated differently. If a vertex is 
within I 1 distance s — 1 in a unit direction from a bounding hypcrplanc, at most 
half of the points of the s-point stencil, corresponding to that unit direction, 
have to be read and written additionally for this vertex. Altogether these are at 

most 2n ■ s ■ ^ . J]^ 1 h = O I/Os. □ 

Hence the lower bound on the torus and grid of the same side lengths differ at 
most by an additive constant of O {jXiZi k^j. For arbitrary B one I/O opera- 
tion affects at most B elements, such that the total number of I/Os (including 
compulsory ones) for the grid is 

3 The Upper Bounds 

The algorithms evaluating the s-point stencil depend on the layout in which the 
grid data is saved. In two dimensions, for example, the lower bound suggests to 
work in adjacent I 1 balls where neighboring balls form a diagonal working band. 
Although, for B = 1 or a data layout supporting this data access, the diagonal 
sweep through the data provides a matching upper bound, it is not optimal for 
many other data layouts. 

All the bounds have in common that a sweep shape is moved through the 
grid in unit shifts in a sweep order resulting in working bands. The sweep shapes 
are line segments in two dimensions and subsets of planes in three dimensions. 
To achieve good results, the data layout has to reproduce the shape of the sweep 
shapes, and to evaluate all vertices of the grid, the working bands have to overlap 
dividing the working band into core and wing bands. For optimal asymptotic 
behavior vertices in different bands have to be saved in separate blocks. 

The description of the data layouts is given indirectly by the core and wing 
bands of the algorithm. For each subset of working bands that have overlapping 
wing bands, the data is save in separate blocks. Within the bands the data is 
ordered in the natural way the algorithm accesses it. Instead of already having 
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the data separated by core and wing bands the algorithm can change the data 
layout while it is executed. 

Having the preceding s and the proceeding s sweep shapes in internal memory 
allows to evaluate the s-star stencils of the current sweep shape. Special attention 
has to be paid to evaluate the vertices which are in the relative boundary of the 
sweep shape. The relative boundary is the boundary of the sweep shape within its 
line or plane and these vertices are evaluated by overlapping the working bands. 
In fact only 2s instead of 2s + 1 full sweep shapes have to be kept in internal 
memory, as the vertices of the s.th preceding sweep shape can be deleted or 
written to the external memory as vertices of the s.th proceeding sweep shape 
are loaded. Only a constant number of vertices, depending on s but independent 
of the size of the sweep shape, are necessary in addition to 2s layers of sweep 
shapes. 

When discussing the upper bounds we focus on the leading term of the non- 
compulsory I/Os and disregard the constant number of compulsory I/Os. The 
exact analysis of the lower order terms is tedious work and presented separately 
in Section [4] The detailed analysis keeps repeating the same steps for every 
algorithm and is mainly bookkeeping of the different lower order terms. 

3.1 Upper Bounds in 2 Dimensions 

The upper bounds in two dimensions are summarized in Table [2] When the sweep 
shape is shifted orthogonal to the alignment of the blocks (block aligned column 
layout) the I/O behavior is asymptotically optimal. However, the constant at 
the leading term of the non-compulsory I/Os matches the lower bound only for 
the block aligned diagonal layout. 

The main results for two dimensional row and column layouts have been pre- 
sented by Leopold in [8 . In a row (column) layout blocks extend in xj-direction 
(^-direction) . Sweeping through a row layout in xi -direction (^-direction) is 
equivalent to sweeping in a column layout in ^-direction (xi-direction). Hence 
only sweeps in xi-direction will be discussed. 



Table 2. Upper bounds for different data layouts in two dimensions 



Layout 


Sweep Shape 


Sweep Order 


Upper Bound 


Row 


Vertical 


Xl 


If • fafca 


Block Aligned Column 


Vertical 


Xl 


Wm ' k\k-i 


Block Aligned Diagonal 


Diagonal 


Xl, Xl 


S3? ' fclfca 



Row Layout: The sweep shape is a vertical line segment of m points shifted 
in X\ direction. The middle m — 2s vertices of a sweep shape can be evaluated 
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when 2s sweep shapes are in internal memory. For B > 2s one block has to be 
in internal memory per row of the sweep shape (mB < M). In the worst case of 
a block aligned layout the 2s previous sweep shapes are also needed in internal 
memory which adds an error term of O ( ^jjy ) (see ( 17 I in Sect . |Z| . Disregarding 
this, the size of the sweep shape is roughly m = Mj. The s uppermost and s 
lowermost rows of a working band cannot be evaluated which adds an error term 
of O ( ) ( see (15), (16)). This error term also accounts for reserving blocks 
for the output. Assuming the whole working band can be evaluated, there are 
about working bands. The 2s uppermost and 2s lowermost rows of a working 
band overlap with the according rows of neighboring working bands and hence 
they cause non-compulsory I/Os. When evaluating the working bands bottom 
up, the lowermost (uppermost) rows cause non-compulsory reads (writes). Each 
B columns there is one such I/O. Altogether, the number of non-compulsory 
I/Os is roughly • 2s • 2 = || ■ k\k 2 . There is an additional error term of 

0{k,2) (see (14l) since the first and last block of each row, including the core 
bands, may be loaded twice. This bound is a factor of B worse than the best 
upper bound since, per row of a working band, a whole block instead of only 2s 
vertices reside in internal memory. Please see Sect. [4] for a detailed analysis of 
the lower order terms and an explanation what they capture. 



Block Aligned Column Layout: In a block aligned column layout the blocks 
extend in ^-direction and start at the same x 2 values for all columns (see Fig. [I] 
- middle). The sweep shape is a vertical line segment of m vertices. To evaluate 
a sweep shape, the s preceding and s proceeding sweep shapes and an additional 
s + 1 vertices of the s.th proceeding sweep shape have to be in internal memory. 
So we have roughly m = ¥- and hence ^ working bands. 

When the vertices of core and wing bands are not saved in separate blocks, 
but assuming B > 2s and the working band size is chosen such that the wing 
bands are totally within one block (otherwise multiply the non-compulsory I/Os 
by 2) one I/O occurs for every column at the top and bottom end of the working 
band. Hence the non-compulsory I/Os are ^ • k\ ■ 2 = j§k±k 2 . Since there are 
two I/Os every column, each writing a full block instead of the necessary 2s 
elements, this bound is a factor of B of the best upper bound. 

Now consider saving the vertices that are part of overlapping working bands, 
defining the wing bands, in separate blocks. Each wing band contain 2s vertices 
per column. So there are 2 I/Os 2 every ^ columns and the upper bound is ^ - 2- 
= MI3 ■ This analysis estimates the leading term of the non-compulsory 
I/Os correctly. Please refer to Sect. [1] for the detailed analysis including and 
explaining the lower order terms. The upper bound, including the compulsory 
I/Os, is 

8s 2 . „„ / 1 1 \\ k x k 2 



O 



M \M 2 k 2 J J B 

This layout achieves the correct asymptotics but the leading term of the non- 
compulsory I/Os is by a factor 2 worse than the lower bound. In [5] Leopold uses 
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a mixed row / column layout which achieves the same constant at the leading term 
of the non-compulsory I/Os. 




Fig. 1. The row (left), block aligned column (middle) and block aligned diagonal layout 
(right) in two dimensions for s = 1. Current 2s + 1 sweep shapes red. 



Block Aligned Diagonal Layout: The lower bound suggests to work in ad- 
jacent l x balls. Within a ball we work in diagonals of direction (1, —1) which we 
shift alternately in both unit directions. These diagonal sweep shapes consist of 
m vertices. There is a close connection between the sweep shape and the path- 
width and the minimum bisection separator of the i 1 ball as graph. The internal 
memory must hold 2 s sweep shapes, hence m = The working bands and 
their overlap again define wing and core bands and the data is saved in separate 
blocks for each of these bands (see Fig. [I]- right). Within a band the data is or- 
dered by sweep shapes the algorithm accesses successively. Per row, a core band 
has 2m — 4s and a wing band 2s vertices. Compared to the column layout the 
width of a core band has doubled. Hence the upper bound, matched by the lower 
bound and including the compulsory I/Os and the lower order terms derived in 
Sect. [4] is 



3.2 Upper Bounds in 3 Dimensions 

The three dimensional results are presented in Table [3] Working orthogonal to 
the alignment of the blocks (block aligned column/pole layout) achieves optimal 
asymptotic behavior. Using an two dimensional i 1 ball instead of a square as 
sweep shape improves the leading term of the non-compulsory I/Os by a factor 
of ^ (2D block aligned diagonal layout). The best new upper bound improves 

this by another factor of leaving a gap of \/2 to the lower bound (hexag- 
onal aligned diagonal layout). The algorithm shifts a hexagonal sweep shape, 
resulting from the intersection of a three dimensional i 1 ball with a plane of nor- 
mal (1, 1, 1) through the origin, and alternates the three unit shifts (hexagonal 
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aligned diagonal layout). Refer to Section [1] for the derivations of the lower order 
terms. 



Table 3. Upper bounds for different data layouts in three dimensions 



Layout 


Sweep Shape 


Sweep Order 


Upper Bound 


Row 

Block Aligned Column/Pole 
2D Block Aligned Diagonal 
Hexagonal Aligned Diagonal 


Square 
Square 
I 1 ball 
Hexagonal 


Xl 
Xl 
Xl 

Xl, X 2 , X3 


• k ^ 

^ ■ k ^ 



In three dimensions up to four working bands overlap. Per sweep shape, the 
number of these overlapping vertices solely depends on s and is independent of 
the size of the sweep shape. Therefore this effect is captured in the the lower 
order terms and hence we can simplify the analysis such that every vertex in 
a wing band causes one non-compulsory write and one non-compulsory read 
operation. 

Poles are the equivalents to rows (xi-direction) and columns (^-direction) 
in rE 3 -direction. As in the two dimensional case, we consider only sweeps in in- 
direction but all layouts. 



Row Layout: As sweep shape we utilize a square with m vertices per side. 
Assuming B > 2s one block contains all necessary 2s sweep shapes and the size 

of the sweep shape is about m = \p^j- A working band has m vertices in x% and 
also in x 2 direction and is surrounded by 4 • m ■ 2s wing vertices per a^a^-plane 
which each cause one I/O every B a^a^-planes. Hence, the leading term of the 



non-compulsory I/Os is 
presented in Sect . [4] yields 



2 + 8s 



■k\k2k^ .A detailed analysis like 




kik 2 k 3 
B 



(13) 



The first error term describes that not the full wing band can be evaluated and 
hence the estimation of the number of working bands contributes an error term. 
If the layout is block aligned, m blocks as well as the 2s precious sweep shapes 
need to be in memory which is captured by the second error term. 



Block Aligned Column and Pole Layout: When sweeping in x± direction 
the column and the pole layout are the same with the role of the coordinates 
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interchanged. In both layouts the sweep shape is a square of m vertices per side 
and the blocks are aligned with the sweep shape. Hence we can choose m = ^/ff ■ 
A working band has m vertices in X3 and also in x 2 direction and a working band 
is surrounded by 4 • m ■ 2s wing vertices per a^a^-plane. Since these vertices are 
sorted according to X2X3-planes each B vertices there is an I/O. Altogether this 
yields ^^h^^ = ^0^- ■ kik 2 h . This layout outperforms the row layout 

by a factor of ^= and is asymptotically optimal. Since we assumed B > 2s for 
the row layout, the block aligned column or pole layout is always advantageous. 

The dominating error term is O ( J • klk ^ ka . The error term is due to the 
layout separating wing and core bands of input and output into different blocks. 



2D Block Aligned Diagonal Layout: The two dimensional diagonal layout 
also sweeps in :ri -direction but takes advantage of the better relative interior to 
relative boundary ratio of the two dimensional I 1 ball compared to this value for 
a square. The sweep shape is a two dimensional I 1 ball which lies in an X2IE3- 
plane. Therefore, the data layout in each a^a^-plane is a two dimensional block 
aligned diagonal layout. 

By ([2| an I 1 ball of m vertices per side (radius m — 1) consists of about 2m 2 
vertices of which 4m are boundary vertices. Hence the size of the sweep shape 

can be chosen as m = \\J 2s boundary layers of the two dimensional i 1 ball, 
about 8ms vertices, are part of the wing bands which are saved separately as 
the algorithm proceeds. Each B vertices cause one I/O. When the grid is tiled 
with the working bands the offset of the working bands is m in ^-direction but 
2m in a^-direction (see Fig. |2j). Hence the upper bound is ^ ^ • k\ ■ = 

j0= ■ kik 2 k 3 . The leading error term O (^^f^j • klk £ ki is due to separating the 
layout into separate blocks for wing and core bands. 



Hexagonal Aligned Diagonal Layout: Imitating the lower bound by tiling 
the grid with l Y balls fails in three dimensions. Nevertheless, a diagonal sweep 
is again advantageous. As sweep shape the intersection of the I 1 ball b^ m '°\o) 
with a plane through the origin and normal (1, 1, 1) is employed (see Fig. [2]). 

This hexagonal sweep shape consists of 2 • f2fc=m+i ^) + (^ TO + •*-) = 3m 2 + 
3m+ 1 vertices. In the same way the diagonal line sweep shape in two dimensions 
has been obtained and this shape is also closely related to pathwidth and the 
minimum bisection separator of the i 1 ball. We shift the sweep shape alternately 
in the three unit directions Xx, x 2 and x% to get the working band. Because of 
the hexagonal structure of the sweep shape we can tile the grid with the working 
bands. 

Figure [2] shows the intersection of a working band with an xiX2-p\a.ne. The 
intersection of the same working band with a different xiX2-plane is obtained by 
shifting the intersection by (1, 1, 1). The intersection consists of 3-(3m 2 +3m+l) 
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vertices. To determine which vertices of these belong to wing bands regard how 
the shifts of the sweep shape affect the s-star stencil. 

Consider the intersection of a working band with an x\x 2 and a point x 
in this intersection in its two dimensional representation. Having the points 
P a (x) = {y e [fci] x [k 2 ] : \\x-y\\i < s} U {y : \\x - j/H^ < s A s, < A ^ < 
(1 < i < n)} U {y : \\x - yW^ < s A x t > A y % > (1 < i < n)} in the 
intersection of a working band with an :Eiizi2-plane allows to evaluate x during 
the sweep. We call P s the two dimensional projection of the s-star stencil (see 
Fig. [2j. We apply P s to the intersection of a working band with a xiX2-plane 
to determine which of these vertices are part of wing bands. This yields 24ms 
vertices per working band and xia^-plane. 

The data layout saves the data grouped according to core bands and wing 
bands shared by different working bands. Within the working bands the vertices 
are ordered in increasing distance to (1,1,1), i.e. they are grouped by sweep 
shapes. Within a sweep shape the vertices are ordered with increasing z- and 
finally with increasing y-coordinate. Using this data layout just, a constant num- 
ber of vertices, depending only on s, are needed in internal memory in addition 
to the 2s sweep shapes. Therefore every B vertices an I/O is done. The size of 

a sweep shape can be chosen as m — y|f ■ It is left to determine the number 
of working bands. Dividing k\k 2 by the number of vertices of the intersection 
of a working band with a rry-plane estimates the number of working bands. In- 
complete working bands at the boundary of the grid can be disregarded as these 
contribute only lower order terms. Hence the upper bound is given by 



See Sect. HI for the analysis including the lower order terms. The leading error 
term is Oy^pj ■ klk ^ k3 and is due to reserving separate blocks for wing and 
core bands. 

3.3 An Upper Bound for Arbitrary Dimensions 

This section gives a simple upper bound for arbitrary dimensions using a block 
aligned column layout. Sweeping through an n dimensional grid with an n — 1 
dimensional hypercube lying in a hyperplane of normal x\ and sweeping it in 
x\ direction yields an easy (asymptotically matching) upper bound. We assume 
that the data is ordered according to core bands and wing bands that belong 
to different intersections of working bands. One side of the hypercube is then 
bounded by 



An (n — 1) dimensional hypercube has 2n faces such that 2(n — 1) faces of the 
working band are surrounded by wing bands. The wing bands have thickness 
2s. The intersection of one wing band shared by two working bands with a 



k x k 2 2Ams k^h 8V2s 3/2 
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Fig. 2. Left: Tiling a 2D grid with ^balls; Middle left: Intersecting a 3D I 1 ball with 
plane of normal (1, 1, 1); Middle right: Projection of 1-star stencil Pi; Right: Inter- 
section of a hexagonal working band with an xy-plane. Vertices that can be evaluated 
blue and wing bands green for s = 1. 



hypcrplane of normal x n consists of a n — 2 dimensional hypercube of side length 
m. There is a working band every m planes for the first n — 1 directions. This 
yields an upper bound of about 



/n-l 



yr k\ 2(n - 1) • 2s ■ m™" 2 • k n _ 4s(n - 1) yr 
11 ™ ) R ~ ~~wR 11 kl 



in B mB 

v i=l / i=l 



= 4.2— . s — ( n - 1) 
The leading error term is 

LOLiA 

and rises from reserving whole blocks for every wing band that is shared by 
different working bands. The quotient of the leading term of the non-compulsory 
I/Os between this upper bound and the lower bound is n ~\fn\ k, - for large 
dimensions. 




4 Detailed Analysis of the Upper Bounds 

The exact analysis of the lower order terms presented in this section is tedious 
work. The detailed analysis keeps repeating the same steps for every algorithm 
and is mainly bookkeeping of the different lower order terms. 

For the detailed analysis of the upper bounds we assume that the data is given 
in the layout separated into core bands and wing bands such that no reordering 
of the data is necessary. Additionally, for each subset of working bands that share 
wing bands, the data is also saved in separate blocks. The number of working 
bands that are adjacent to each other is at most 2". This and also the number 
of subsets of overlapping working bands solely depends on n. 
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We make intense use of the assumptions jf —> oo for i £ {1, ... ,n} and 
^ — > oo and regard everything that grows slower than the leading term of 
the non-compulsory I/Os as lower order terms. In particular terms that solely 
depend on n or s are regarded constant. 

To account for all the I/Os that could occur during the algorithm we first 
derive a more complicated formula for the upper bound taking into account the 
difference between working band and the vertices that can be evaluated in a 
working band, overlap of more than two working bands, blocking the working 
band into several parts belonging to different intersections of wing bands, reserv- 
ing blocks for the output, etc. In this first estimation we are sometimes overly 
pessimistic to simplify the formula for the upper bound. This formula is then 
simplified step by step while we keep book of all error terms that are dropped. 
When an error term is calculated, we summarize in parentheses what the error 
term accounts for. 

When simplifying denominator of upper bounds we use the following obser- 
vation: 

11a 

= - + — • 

m — a to m z — ma 

As the output is written in whole blocks we have to save space in internal 
memory to gather the results of every separately blocked intersection of wing 
bands. As discussed this number is constant and so, for some constant c, M — cB 
spots in internal memory are left to load the input data. 



4.1 Analysis of the Lower Order Terms of the Two Dimensional 
Row Layout 

This section details the analysis of the two dimensional row layout first discussed 



in Sect. 3.1 To ensure that only a constant number of output blocks is needed, 
the data has be be written back in block aligned column layout. Otherwise an 
output block would be needed for every row of the sweep shape making the 
number of output blocks dependent on the size of the sweep shape. This change 
in layout is only needed for the two and three dimensional row layout, as only 
there vertices that are evaluated consecutively are saved in different blocks. This 
reordering ensures that we need only c = 3 output blocks. 

Now consider the input. For B > 2s one block has to be in internal memory 
per row of the sweep shape. In the worst case the layout is block aligned so that 
the 2s previous sweep shapes are also needed. Since we do not need to write the 
vertices of the core band we reorder them in internal memory so that at most 
2s ■ (m — 4s) have to stay in internal memory. The 2s • 4s vertices of the wing 
bands are needed to evaluate other working bands. To preserve the data layout 
the whole block containing the 2s vertices is kept in memory. This gives 



m-B 



2s ■ (to - 4s) 
B 



1 \B + 4s • B < M - cB 
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Capturing in d = c + 2 + 4s the effects of the additional blocks for the wing 
bands and core bands, this solves to 



m 



M -c'B + 8s 2 
B + 2s 



m— 2s 

k- 



estimates the number of working bands. Per row there are at most 

B ] + 1 blocks. Per working band each of these blocks causes an I/O for each of 
the 4s wing rows. Additionally, if the block does not end when the row ends, we 
need another 2m I/Os per working band to write and read the blocks covering 
two different rows. This yields an upper bound of 



fc 2 










^2to + ^ 




+ lj 4s^j 


to — 2s 




B 





which we now simplify step by step collecting lower order terms. Consider the 
second term first. Simplifying the second term to ^ • 4s yields an error term of 
at most 

— ^— ■(2m+2-4s) = 0(fc 2 ) . (14) 
m — 2s 

(If every block spans two rows, there is one additional I/O per row.) Now sim- 



plifying the first term to 



m — 2s 



adds an error of about 



o 



(Contribution of the wing rows of the last (partial) working band.) So we are 
left at 

J*-.*. 4a. 
m-2s B 

By simplifying the denominator to m, we collect another error term of 



k 2 



2s 



to 2 — 2ms B 



-± ■ 4s = 0\ hk 2 



B 



(Core band is smaller than working band .) to arrive at 



k 2 fci 
to ' ~B 



4s 



M-c'B+8s 2 
B+2s 



B + 2s 



h 

c'B + 8s 2 — B — 2s B 



(15) 



4s . 



First simplify the denominator to M, collecting 



fc 2 .|.4s.(5 + 2s).— 



c'B - 8s 2 + B + 2s 



Mc'B + 8s 2 M - MB - 2sM 



o hk 2 



B 
M 2 



(16) 



(Output blocks, work in full blocks.) Now simplify the enumerator to B to arrive 
at 

l. B kl A 43 l. L. 

k *-M-B- 4s =M- klk2 
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collecting the last error term of 



, ki 2s /r .fk 1 k 2 \ M „, 

^■b^ s -m = °{bm) • (17) 

(m blocks and 2s sweep shapes have to in in internal memory.) Let us sum up 
and simplify the error terms: 

This yields the final upper bound of 

4 « , , ^ {, , , B fcifc 2 \ /4s m (\ B 1 \\ k x k 2 
■k 1 k 2 +0[k 2 + k 1 k 2 ^ + -^) = [—+0[— ' 



M \ M 2 BMJ \M \kt M 2 BM B 



A.I Analysis of the Lower Order Terms of the Two Dimensional 
Block Aligned Column Layout 

This section derives the lower order terms that apply to the two dimensional 
block aligned column layout, c = 3 blocks have to reserved for the output of the 
core band and its two wing bands. Of the core band at most 2s sweep shapes and 
s vertices have to be in internal memory. We assume that 2s + 1 wing bands are 
in internal memory. Hence m, the number of vertices in a vertical sweep shape 
is limited by 



2s • (to - 4s) + s 
B 



(2s+ l)2s 
B 



1 }B ■ 2 < M - cB 



This requirement is met, denoting c' = c + 6, for 

M-c'B- 5s 



rn = 



2s 



The number of working bands is 
all wing bands is bounded by 



m — 2s 



and hence the number of vertices in 

fcl • 2s. Since the wings are saved in 

consecutive blocks, each Block is at most accessed twice during the execution of 
the algorithm and hence causes at most one compulsory write and read operation. 
Therefore the number of I/Os is given by 



k 2 
m — 2s 



fcl • 2s 
B 



■2 



Dropping the outer ceiling function adds an error of 2 and dropping the inner 
ceiling function an error term of 



O 
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(Error for the last, incomplete working band.) Dropping the 2s from the denom- 
inator yields an error of 



4s • fci&2 



B m 2 — 2ms 



\BM 2 



(Estimating the number of working bands - not accounting for outer part of wing 
bands.) The upper bound can now be approximated by 



4s • fcifc 2 



2s 



B M- c'B -5s -2s ' 
for which we simplify the denominator to M yielding an error of 



8s 2 • hk 2 
B 



c'B - 7s 



M 2 - c'BM - 7sM 



O 



k x k 2 
~M 2 



(Separate blocks for core and wing bands, blocks for output.) Hence the upper 
bound is 



8s 2 • fcifc 2 ^ /fci fcifc 2 



BM 



B M 2 



8s 2 + O 



M B 
Y 2 + M 



k 1 k 2 
~BM 



4.3 Analysis of the Lower Order Terms of the Two Dimensional 
Block Aligned Diagonal Layout 

For the two dimensional block aligned diagonal layout c = 3 output blocks have 
to be reserved. For the input there have to at most 



2s(m - 2s) + s 
B 



1 \B 



(2s + l)s 
B 



1 2 • B < M - cB 



vertices in internal memory. The first term accounts for the 2s core bands and s 
additional vertices of the 2s + l.th band. The second term contains the vertices of 
both wing bands of width s. 2s + 1 diagonals per wing band are kept in memory. 
Using c' to gather the terms of the ceiling functions and the following +l's we 
get 

M - c'B - 3s 



2s 



The number of working bands per row can be estimated by 
2m 



fci 



+ 1 as 



2m-2s 

2s vertices can be evaluated per row of a working band. For each of the 
two wing bands there are 2s vertices per row for which every B vertices there 
is an I/O. Note that, when the wing bands are ordered bottom up, the upper 
wing band of a working band is the lower wing band of the next working band. 
Hence the number of I/Os can be bounded by 



fci 

2m - 2s 



1 



2sfc 2 
~B~ 



■ 2 



2G 



We are ready to simplify this term. Dropping the outer ceiling functions just 
adds 2 additional I/Os for the last block of the two last wing bands. Dropping 
the inner ceiling function and the corresponding +1 yields an error of O (^f)- 
(Estimation of the number of working bands.) The current upper bound formula 
is 

fci _ 2sfc 2 _ 
2m - 2s ' B 

so we next simplify the denominator to 2m yielding an error term of 



2sk2 1 s = / hk 2 \ 

B ' ' 1 ' 2 ' m 2 - ms V BM 2 / 



(Difference between width of working band and the vertices that can be evaluated 
in a working band.) Now we substitute the value for m to get 

h 2sfc 2 ki sk 2 sk 2 2s 



2m B L M ~ C 2f~ 3;i J B ~ B M-c'B-bs 

So we simplify the denominator to M yielding an error of 



B M 2 — c'BM - 5Ms \ M2 

(Blocks for output, wing bands and core bands need separate blocks.) Altogether 
this yields the upper bound of 

ic 2 hk 2 | /fc 2 | hk 2 \ = (^ + (D (^ + b\\ k x k 2 



BM \B M 2 J \M \k x M 2 ) ) B 



4.4 Analysis of the Lower Order Terms of the Three Dimensional 
Row Layout 

The sweep shape is a square of m vertices per side, lying in an x 2 x 3 plane and 
shifted along x\. To need only a constant number of c = 9 output blocks the 
results are saved in a block aligned column layout, as in the two dimensional 
case. Assuming B > 2s one block has to be in internal memory per row of the 
sweep shape. In the worst case of a block aligned row layout, the 2s previous 
sweep shapes are also needed. As in two dimensions, the vertices in the core 
band are rearranged in internal memory while the vertices of the wing bands 
stay in row layout when the input is written back to external memory. The core 
band is a square band of m — 4s vertices per side. The wing bands consists of 
4-(m — 4s) -2s +4 -2s -2s = 2s- (m — 2s) rows in total. They can be subdivided into 
two classes. There are four wing bands each when two respectively four working 
bands overlap. They consists of (m — 4s) • 2s or 2s • 2s vertices each. So m has 
to fulfill 

+ 1 ) -B + 4-2s- (m-2s) B < M - cB . (18) 



m 



B + 



2s • (to - 4s) z 
B 
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Solving this quadratic equation, adding an additional +1 for dropping the ceiling 
function, yields 



16s 2 



isB± 



mi 2 



2- (B + 2s)- 



±>/(-16s 2 + 8sB) 2 - MB + 2s)(32s 3 + 2B- l6s 2 B - M + cB) 



Since to only has to suffice the upper bound from (18 1 we can estimate 



16s 2 - 8sB ± y/(-16s 2 + 8sB) 2 - MB + 2s)(32s 3 + 2B- 16s 2 B - M + cB) 

2 • (B + 2s) 

-8sB + ^A{B + 2s) (-32s 3 - 2B + M - cB) 
~ 2 • (B + 2s) - 



\[M - 32s 3 -2B-cB 



> 



4 s B 
VB+2s 



VB + 2s 



> 



So we choose 



VM - 32s 3 



2B - cB -7Tms 



VB + 2s 

For the analysis of the lower order terms we need the asymptotic behavior of m 
O ( i / 4f- ) . There are k J„ working bands in X2 direction and 



which is m 



m— 2s 



in X3 direction. The wing bands that take part in two working bands 

cause one non-compulsory I/Os for each working band they take part in. The 
wing bands that take part in four working bands cause 6 non-compulsory I/Os 
in total. We overestimate this by 2 non-compulsory I/Os for every working band 
they are in. Furthermore we account for another 2m 2 I/Os per working band 
to read and write the first and last block of each working band. Altogether the 
upper bound is 



fc 2 
m — 2s 



fc 3 
m — 2s 



2m z 



h 

B 



1 • 4 • (2s • (m - 4s) + 2 • 2s • 2s) 



This is now simplified. First, dropping the first two ceiling functions yields the 
two error terms 



r £ 3 


• ^2to 2 + ( 




+0 


to — 2s 




B 





1 -4 - (2s- (to -4s) + 2 -2s -2s) 



O 



and 



fc 2 
to — 2s 



2to 2 



1 • 4 ■ (2s ■ (to - 4s) + 2 ■ 2s ■ 2s) 



O 



'kik 2 
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(The last, incomplete working band in both directions.) Simplifying the third 
) ^ adds a] 

k 2 h 



term to ^ adds an error of 



m — 2s m — 2s 



(2m 2 + 2 • 4 • (2s • (m - 4s) + 2 • 2s • 2s)) = O {k 2 k 3 ) 



(Additional I/Os for each row when blocks go over two rows and, lower order, 
estimating blocks per row.) such that the upper bound reads 

k 2 k 3 ki , . . . . k 2 fc 3 fci „ 

— -4- (2s- (m -4s) + 2- 2s- 2s) = 6 -Sms . 

m — 2s m — 2s B K K ' ' m - 2s m - 2s B 



We drop the —2s terms which yields errors of 
2s 



k 2 



fc 3 k l o n 

■ — ■ 8ms — O 



m? — 2sm m — 2s B 

k 2 , 2s fci 

fc 3 • — o — ^ ' -r ■ 8ms = O 

m m z — 2sm B 



k\k 2 k 3 

M 
hk 2 k 3 

M 



and 



(Outer wing band cannot be evaluated.) As next step we substitute the value of 
m and get, using the concavity of the square root, 



< 



8s • kik 2 k 3 1 16s • k x k 2 k 3 
B m ~ B 

8s-fcifc 2 fc 3 VB + y/2s 

_ 



VM- 3 2s 3 - 2 B-cB fa-S— 

V B + 2s 

VB+2s 



< 



M - V32s 3 -2B-cB - -^§S= - V# + 2s 



Simplifying the enumerator yields an error of 

8s • kik 2 k 3 v^2s 
_ 



M - V32s 3 -2B-cB- - ^B + 2s 



O 



k\k 2 k 3 \ 

bVm J 



(2s previous sweep shapes are needed in memory.) and then simplifying the 
denominator 



8s • k\k 2 k 3 
B 



■ 4b- 



V32a3 _ 2B - cB + + ^B + 2s 



M - \[M (y/32s 3 -2B-cB- - ^JB + 2s 



= O 



felfc 2 fc; 

M 



(Blocks for output and for each part of the core and wing bands.) Hence the 
number of non-compulsory I/Os is upper bounded by 



8s • k\k 2 k 3 

4b4m 



O 



k\k 2 k 3 k\k 2 k 3 



M 



k\k 2 k 3 



BVM ) \J~B\fM 



8s + 



B 



M 



IB 
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4.5 Analysis of the Lower Order Terms of the Three Dimensional 
Block Aligned Column/Pole Layout 

The sweep shape is a square of to x m vertices lying in an x 2 .T 3 -planc shifted 
in X\ direction. As the sweep shape and sweep order are identical to the three 
dimensional row layout, the size of the wing and core bands are the same and up 
to c = 9 output blocks are needed. The column and pole layout are completely 
symmetric and the discussion is restricted to the column layout. As the vertices 
are saved in separate blocks for core bands and wing bands that are shared by 
different sets of working bands, m has to suffice 



2s ■ (to - 4s) 2 + (to - 4s) • s 



B 



+ 1 B+ 



+ 



(2s + 1) • (m -4s) • 2s' 
B 

(2s + 1) • (2s) 



+ 



B 



+ I ) ■ B • 4+ 

•/)'•!• M cB . 



The (to — 4s) • s part in the second term captures the vertices of the s.th pro- 
ceeding sweep shape that are also needed in internal memory in addition to 2s 
complete sweep shapes. Using d = c + 18 to take care of the ceiling functions 
and the corresponding +l's, solving the resulting quadratic equation yields that 
a solution is given by 



-9s 



v/sTs^ 



4 • (2s) • (-M - 20s 2 + c'B) 



> 



2 • 2s 

-9s + y/8s ■ (M - c'B) 

4s~ 



> 



So we choose 



TO : 



VM - c'B 



9y/s 

V2 



c'B 
V2s~ 



9Vf 
V2 



2.s 



which yields to = O (^/M^j for the analysis of the lower order terms. There 

are m ^ 2 2s working bands in x 2 direction and — j* 2s j in X3 direction. The 

vertices in the wing bands cause non-compulsory I/Os. One non-compulsory 
I/Os is caused by a vertex in a wing band shared by two working bands for each 
working band it belongs to. 6 non-compulsory I/Os are caused in total by the 
vertices in the intersection of four working bands. We estimate this by two non- 
compulsory I/Os for each participating working band. Hence the upper bound 
is 

i_ h 
2s m — 2s 

~ki(m - 4s) • 2s" 



B 



+ 1-4 + 



h ■ (2s) 
B 



2 1 



+ 1-4-2 
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We simplify this term by dropping the ceiling functions and the corresponding 
+1. This yields the error terms 



h 


id 


m — 2s 





k 2 

m — 2s 



k 2 



ki(m - 4s) ■ 2s 
B 



ki(m - 4s) • 2s' 
B 



1-4 



fci ■ (2s) 2 
B 



B 



1-4 



fci ■ (2s) 2 
B 



I J • 4 ■ 2 | = 

O 

I J - 4-2 

= O 



B 



and 



m— 2s m — 2s 



(2 • 4 + 2 • 4 • 2) = O 



M 



(The last working band in x 2 and £3 direction and blocks of wings of working 
bands that are at the end and beginning of a row.) such that the upper bound 
has simplified to 



k 2 



k 3 



m — 2s m — 2s 



fcr(m-4s)-2s /[ | h ■ (2s) 2 | o ( 



B 



fci 



to — 2s m — 2s B 



fc 2 



• 8sto . 



(Estimating the number of working bands. Outer part of wing bands cannot be 
evaluated.) Dropping the two —2s terms yields the error terms 



k 2 



2s 



k 3 k x 



m 2 — 2sm to — 2s B 



2s 



fci 



2sm B 



■ 8sm = O 



■ 8sm = O 



k\k 2 k 3 
BM 

hk 2 k 3 
BM 



and 



Substituting the value for m and using the concavity of the square root yields 

1 



k 2 k 3 ki 8s • k\k 2 k 3 

— . 8sm = 

to m B B 



•JM-c'B- 



V2_ 



lis 



< 



< 



8s • k\k 2 k 3 



2s 



B 



s/M-dB - ^ - \/2s 



< 



< 



8s • k\k 2 k 3 



12s 



V2 



B y/M-VdB 
The enumerator is simplified to \J~M yielding another error term of 



2s 



s 3 / 2 • k!k 2 k 3 
B 



f dB+ + V2~s 



M+VM 



n 9^ 



2s) 



Q f kik 2 k 3 \ 
fBM ) 
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(Full blocks for output and each part of the wing bands and the core band.) 
Hence the non-compulsory I/Os can be upper bounded by 



8V2 ■ s 3 / 2 ■ hk 2 k 3 | c / k 1 k 2 k 3 



kik 2 k 3 



8^2 • s 3 / 2 + O 



'B_ 
M 



4.6 Analysis of the Lower Order Terms of the Three Dimensional 
2D Block Aligned Diagonal Layout 

The sweep shape is a two dimensional i 1 ball of radius to lying in an a^a^-planc 
and shifted along x\. It consists of 2m 2 + 2m + 1 vertices. Within a sweep shape 
the vertices are saved in diagonals like in the two dimensional layout but in 
separate blocks according to core and wing bands that are shared by different 
working bands. As the wing bands are shared with four other working bands in 
up to eight different ways, c — 9 blocks have to be reserved for the output. Take 
the intersection of a working band with a X2X3-plane and denote with A, A' and 
A" the number of vertices in that intersection that are in the core band, in the 
wing band shared by two working bands and in the wing bands shared by four 
working bands respectively. It then holds that A + 4A' + AA" = 2m 2 + 2m + 1 
and 4A' + 4A" < 4 • 2s • (to +1). A is quadratic in to, A' is linear in to and s 
and A" solely depends on s and is independent of to. Hence m is limited by 



2s • A + s ■ (m+ 1) 
B 



+ 1 ■ B 



+ 



(2s +1) ■ A r 
B 

(2s + 1) ■ A" 
B 



lj ■ B ■ 4+ 

- 1 I ■ B ■ 4< M - cB . 



The s • (m+ 1) in the first term upper bounds the vertices of the s.th proceeding 
sweep shape. Estimating the ceiling functions and the corresponding +l's with 
d = c + 18 and solving the resulting quadratic equation yields an upper bound 
for to, namely 

-13s + t/(13s) 2 - 4 • (4s) • (lis + c'B - M) > y/M-lls-dB - ^ 



Hence we choose 



2 -4s 



m = 



y/M- lis - c'B - ^ 



2 • 



which has the asymptotic m = O Mj . As next step we estimate the number 

of working bands. While a quadratic working band is shifted by to — 2s in x 2 - 
and ^-direction, the I 1 ball working band can be shifted by 2to — 3s in x 2 
direction and to in x 3 direction. Note that the best offset of the working bands 
varies slightly with s and the core bands overlap for the two dimensional i 1 ball 
as sweep shape. But this does not affect the leading term of the non-compulsory 
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I/Os. Again, we account for two non-compulsory I/Os for all vertices belonging 
to four working bands. The upper bound is then estimated by 



k 2 



2m — 3s 



1 



m 



1 



B 



+ 1 -4 + 



ki ■ A" 
B 



+1-4-2 



Dropping the ceiling functions and the corresponding +l's yields the error terms 



2 ■ 



m 

2m - 3s 

k 2 h 
2m — 3s m 



fci ■ A' 
B 



-1 



fa ■ A" 



B 



1-4-2 



O 



fafa 
~B 



fa- A' 
B 



+ 1J -4 + 
(2 • 4 + 2 • 4 • 2) = 



fci ■ A" 
B 

k 2 k 3 
M 



+1-4-2 



O 



fak 2 
If 



and 



(The last working bands in x 2 - and ^-direction and the blocks of wing bands 
at the beginning and end of a row.) such that the upper bound reads 



k 2 _ fa 
2m — 3s m 



fci -A' fa ■ A" 

— 4+ — 

B B 



4- 2 < 



< 



k 2 



2m 



^3 fci 

3s' m' B 
k 2 



< 



(4 • A' + 8 • A") < 
fc 3 fci 



2m — 3s m _B 



(8s-(m + l) + G(s 2 )) . 



We drop the O (s 2 ) = O (1) term, the +1 from (m+ 1) and simplify the denom- 
inator of the first term to 2m. This yields the error terms 



k 2 



fc 3 fci 



fc 2 - 



2m — 3s m B 

k 2 h fci 
2m — 3s m B 
3s fc 3 fci 



4m 2 — 2m • 3s m B 



O (s 2 ) = O 
■ 8s = O 
■ 8sm = O 



kik 2 k 3 
BM 

kik 2 k 3 
BM 

k\k 2 k z 
BM 



and 



(Overestimating the non-compulsory I/Os caused by the wing bands part of 
four working bands. Overestimating the vertices in the wing bands. Estimating 
the number of working bands - outer part of wings cannot be evaluated.) Now 
substitute the value for m into the upper bound and use the concavity of the 
square root to get 



4s • kik 2 k 3 1 
B m 



< 



4s • k\k 2 k 3 
B 



4s • k\k 2 k 3 
B 



1 



VM-lls-c'B-i%5 

2-v^ 



< 



M — VlTs + c'B 



4 
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As last step, simplify the denominator to yM, yielding an error of 



(Whole blocks for output and each part of the wing bands and the core band.) 
Hence an upper bound for the number of non-compulsory I/Os is 

8s 3 / 2 ■ fcifc 2 fc 3 Q ( hk 2 k 3 \ = fcifc 2 fc 3 f g3/2 ( Vb\ \ 

B\[M \sfB'M) BVM \ S \VM J J 



4.7 Analysis of the Lower Order Terms of the Three Dimensional 
Hexagonal Aligned Diagonal Layout 

Let the algorithm for the hexagonal aligned diagonal layout as described in Sect. 
|3.2| work on the layout with different blocks for the core bands and wing bands 
shared by different working bands. Per working band, there are 12 different 
kinds of wing bands that are accessed so we need c = 13 additional output 
blocks. Consider a working band and its intersection with a plane of normal 
(1, 1, 1). Of this intersection denote with A the number of vertices that are in 
the core band, A' the vertices that are in wing bands shared by two working 
bands and A" the vertices of the wing bands that are shared by three working 
bands. We then have A + 6A' + 6A" = 3m 2 + 3m+l, the full sweep shape. While 
A depends on m quadratically, A' is linear in m and s and A" solely depends 
(quadratically) on s. For a constant 1 < d < 2 accounting for the vertices in the 
2s + l.th sweep shape that have to be in internal memory also, m has to fulfill 



2s A + d ■ ms 
B 



15- 



(2s + I) A 1 
B 

(2s + I) A 
B 



I | 6 • B+ 

1 \ 6-B< M-cB . 



Dropping the ceiling functions and the corresponding +ls in the constant d we 
have to obey 

M - c'B > 2sA + dms + 6(2s + l)A' + 6(2s + I) A" = 

= 2s(3m 2 + 3m + 1) + dms + 6 A' + 6A" . 

Since A' is linear in m and s and A" quadratic in s there are constants d! and 
d" such that we need to sovle 

2s(3m 2 + 3m + 1) + d'ms + d" s 2 - M + c'B = . 

We can restrict ourselves to the positive solution of the quadratic equation. 



rn 



-(6s + d's) + y/(2s ■ 3 + d's) 2 - 4 • 6s • (2s + d"s 2 - M + c'B) 

2 • 6s 
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Since we just need an upper bound for to, this equation can be simplified to, 
using that all involved constants are positive, 



-(6s + d's) + y/(2s ■ 3 + d's) 2 - 4 • 6s • (2s + d"s 2 - M + c'B) 



2 • 6s 



> 



> 



-(6s + d's) + a/-4 • 6s • (2s + d" s 2 — M + c'B) 
2~ 6s 

_§ f^r + y/ Ms ~ s ( 2s + d " s2 + C ' B ^ 



Hence we choose 



^/Ms - s(2s + d"s 2 + c'B) - 

V^ 2 



As next step the number of working bands needs to be estimated. Therefore 
consider the intersection of a working band with an xia^-planc. The intersection 
of the working band with the plane consists of 3(3m 2 + 3to + 1) vertices. Parts of 
the wing bands cannot be evaluated but as the wing bands are linear in to and 
s there is a constant s such that 9m 2 + 9to + 1 — ems > 9m 2 — ems vertices of 
these intersection can be evaluated. The width (height) of this intersection, the 
distance between the leftmost and rightmost (lowermost and uppermost) vertex, 
is bounded by 7m. Hence enlarging the grid by 7m vertices in positive and 
negative x 2 - and a; 3 -direction ensures that we get a lower bound for the number 
of working bands in each xia^-plane when we divide the vertices per plane by 
the vertices which can be evaluated per working band. Hence the number of 
working bands is estimated by 

(fc 2 + 14m) (k 3 + 14m) 



9m 2 



ems 



In the intersection of a working band with an a; 1 x 2 -plane there are at most 
24 (to + l)s wing vertices. These divide themselves into wing vertices that are 
shared by two or three working bands. Denote with C the number of wing vertices 
that arc shared by two working bands and C the vertices that are shared by 
three such that 6C + 6C" < 24(m + l)s. C depends linear on to and s and C" 
depends on s quadratically. The number of I/Os is then given by 



(fc 2 + 14TO)(fc 3 + 14TO) 
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The term including C" is multiplied by 2 so that we always pay a read and 
write operation for these vertices. This is too pessimistic since we do not take 
advantage of the compulsory I/Os. 

We can now start simplifying. First drop one of the terms including C , 
namely 



(k 2 + 14m) (k 3 + 14m) 
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(Wing bands shared by three working bands.) Then drop the ceiling functions 
and the corresponding +ls to get an error of 

(fc 2 + 14m)(fc 3 + 14m) u = Q /Mfe 
9m 2 — ems \ M 

(Blocks at the end and beginning of the wing bands.) The upper bound now 
reads 

(k 2 + 14m)(fc 3 + 14m) f & k x C | ^C" 



9m 2 — ems \ B B 

< (k 2 + Um)(k 3 + 14m) 24(m + l)s 

9m 2 — ems B 1 

Dropping an 24s from the enumerator of the second term gives an error of 
(fc 2 + 14m)(fc 3 + 14m) 24s _ ( ' k x k 2 k 3 



9m 2 - ems B \ BM 

(Estimation of wing size.) Let us first simplify the denominator to 9m 2 which 
yields an error of 

(k 2 + 14m)(fc 3 + 14m) 24ms 1 es _ ( k x k 2 k 3 



1 B m 81m 2 - 9ems \ BM 

(Estimation of m and wings of 2s+l.th layer.) Now the grid sizes can be changed 
back to k 2 and k 3 gathering the error terms 

k 3 + 14m 24ms / hk 3 \ 

14m — - — = ^—ki = O = and 

9m 2 B \bVMJ 

k 2 24ms /" fcifc 2 

14m — — — k\ = O 



9m 2 B \ By/M 

(Estimation of the number of working bands.) We are now ready to substitute 
the value for m and simplify using the concavity of the square root: 

hk 2 k 3 24ms 2Ak x k 2 k 3 1 
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As last step, simplify the denominator to \J~M~s and collect an error of 

8sfcifc2fc3 



3B 



y/s(2s + d"s 2 + c'B) + + VQs2 
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(Estimation of sweep shape size. Taking into account full blocks for each part 
of the core and wing bands, difference between working bands and vertices that 
can be evaluated in a working band.) The error terms simplify to 



O 



k\k 2 k 3 k x k 2 k x k 2 k x k 2 k 3 \ _ q( k\k 2 k 3 \ 
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such that the final upper bound is 

8V2 ■ s 3 / 2 hk 2 k 3 | Q f hk 2 k 3 ^ = ( 8V2-s 3 / 2 
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4.8 Analysis of the Lower Order Terms of the Arbitrary 
Dimensional Block Aligned Column Layout 

This section analyzes the number of non-compulsory I/Os for the algorithm 
sweeping an n — 1 dimensional hypercube of side lengths m in xi-direction in a 
block aligned column layout. The dimension n > 3 as well as the stencil size s is 
held fixed, hence both are O (1). The number of subsets of working bands that 
overlap with a fixed working band solely depends on the dimension. Hence the 
number of blocks needed for the output is O (1). As input 2s sweep shapes of 
dimension n — 1 are needed as well as about s + 1 hypercubes of dimension n — 2 
of the s.th proceeding sweep shape. An additional O (B) vertices are needed to 
account for complete blocks. Altogether, m has to suffice 

2s • m™- 1 + O (to™- 2 ) + O (B) < M - O (B) . 
In the asymptotic analysis this becomes 

2s • TO™" 1 + (m™- 2 ) =2s-(to + C(1))"- 1 , 
such that we can solve 

For i 6 {2, ... , n} the number of working bands in direction Xi is bounded by 
J^ 2s ■ The working bands extend in ^-direction. The 2s outermost layers of a 
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working band are wing bands and 2(n — 1) sides of a working band are covered 
by wing bands. For each layer of wing bands and each side of a working band, 
the intersection with a hypcrplanc of normal X\ is a subset of a n — 2 dimensional 
hypercube of side length to. 

Consider such a hyperplane of normal x\ for the following discussion. The 
number of points in the intersection of two adjacent working bands is of order 
m"~ 2 and the vertices which belong only to these two working bands cause 
exactly 2 non-compulsory I/Os, one for each working band they are part of. 
When the intersection three working bands is regarded the number of points 
is of order to™~ 3 and vertices in only these three working bands cause 4 non- 
compulsory I/Os. In general, the number of vertices in more than two working 
bands is of order m n ~ 3 or less and the number of I/Os necessary for these vertices 
solely depends on the number of dimensions. Hence the I/Os caused by vertices 
in wing bands that are part than more than two working bands are of order 

We also assume that another O (B) vertices cause I/Os to account for in- 
complete blocks of wing bands. Hence an upper bound is given by 



n 

i=2 



m — 2s 



' (2s ■ 2(n - 1) • m"- 2 + O (to™- 3 )) ■k 1 +0{B) 



B 



This bound is simplified by dropping first the O (B) and then the O (to™ 3 ) 
term, yielding errors of 




(Error for incomplete blocks in the wing bands. More than two non-compulsory 
I/Os for vertices that are part of more than two working bands.) Dropping the 
ceiling functions then, adds an error of at most 

(Last, incomplete working band in the last n — 1 directions.) Simplifying the 
denominator to to is bounded by 



O 



nr=ifci 



(The outer parts of the wing bands cannot be evaluated.) Substituting to, the 
upper bound now reads 
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Dropping the O (1) term adds an error of at most 



O 



B n ~yW 



(Vertices needed of the s.th proceeding sweep shape.) and then the concavity of 
the root can be applied to yield 



4s • (n - 1) • rnu h 
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4s-(n-l)-nr=i^'(2s) 1/(ri - 1) 
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Finally dropping the term n ~{/0 (B) yields an error of 

nr=i^ ■ i'D \ _ „i ii i ''''' 
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"-\/B™-2 • M 2 



(Blocks for output, whole blocks for core band and each part of the wing bands.) 
and the upper bound 

4-2^ fl ^.(n-l)-nr =1 fc i 



B n_ \/M 

For n > 3, the last error term is dominant such that the upper bound including 
error terms is 



4-2^s^t . (n-1) -n?=i*i 
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